[00:08.040 --> 00:16.080]  Hello, everyone. Good afternoon. I hope you're enjoying Devcon as well as I am. So my talk
[00:16.080 --> 00:22.820]  is about Androzia, which is a tool for securing data in process for your Android applications.
[00:23.140 --> 00:30.540]  So if I had to explain what Androzia is doing in a couple of lines, it's basically providing
[00:30.540 --> 00:39.580]  you the features of a proactive garbage collector. And garbage collectors inherently are lazy.
[00:39.580 --> 00:45.480]  So they only kick in when your application is memory constrained. And they will reclaim
[00:45.480 --> 00:52.440]  memory from objects which are unreferenced. But with Androzia, you will be able to reclaim
[00:52.440 --> 01:01.300]  memory from objects immediately after their last use. So this is helpful because if you
[01:01.300 --> 01:06.320]  have a lot of sensitive information lying around in your heap memory, as soon as it
[01:06.320 --> 01:12.500]  is last used, Androzia is going to kick in and collect all that sensitive information
[01:12.500 --> 01:18.160]  from your application and replace the memory contents with the default values. So your
[01:18.160 --> 01:23.380]  sensitive data will not remain on the memory anymore. So before we begin, a little bit
[01:23.380 --> 01:29.080]  about myself. I'm Samit. I work for the product security team at Citrix. I'm a web and mobile
[01:29.080 --> 01:33.280]  application security enthusiast. Spoken at a bunch of conferences, including Black Hat
[01:33.280 --> 01:43.060]  Asia, AppSec USA, Cocon, CodeBlue, IEEE Services, MobileSoft, and CloudCom. So, yeah, a little
[01:43.060 --> 01:51.660]  bit of contact information in case you want to contact me. So starting off, the first
[01:51.660 --> 01:58.820]  question for you is, which one of these is the most difficult to protect? Number one
[01:58.820 --> 02:05.800]  data at rest, number two data in process, or number three data in motion? So arguably,
[02:05.800 --> 02:13.000]  I would say that it's number two. Why? Because the least number of solutions in the market
[02:13.000 --> 02:19.900]  are for number two. Because this is a very gray area. You know, EMM providers do not
[02:19.900 --> 02:27.540]  really understand how to protect data in process. Data at rest is very simple. You use any ciphers
[02:27.540 --> 02:32.680]  and then protect your data. Data in motion can be protected using TLS. However, data
[02:32.680 --> 02:42.980]  in process is still a mystery for most of us. So this tool targets number two. And we'll
[02:42.980 --> 02:53.060]  see how, yeah. So, going ahead. So Java creates a lot of objects, or Android itself creates
[02:53.240 --> 02:57.900]  a lot of objects on the heap, right? And the objects may contain a lot of sensitive information
[02:57.900 --> 03:03.820]  which may include your authentication credentials, authorisation tokens, and encryption decryption
[03:03.820 --> 03:11.560]  keys, bins, ODPs, and personally identifiable information. So, as in when the application
[03:11.560 --> 03:17.760]  is executing, all of these will be on your heap memory. So the developers want to ensure
[03:17.760 --> 03:23.260]  that they get rid of all the sensitive information as soon as it is last used in the programme.
[03:24.400 --> 03:29.760]  So the myth is that the garbage collector will eventually collect it. But, you know,
[03:29.760 --> 03:34.460]  the garbage collector is very lazy, as I said. It will not come into the picture, or it will
[03:34.460 --> 03:41.140]  not collect all your data until it is essential for it. Because unless your application is
[03:41.140 --> 03:46.540]  memory-constrained, it will not come into the picture. And even if it does, right, its
[03:46.540 --> 03:53.320]  scope is very limited. So this is something which is called as an object tree on the heap.
[03:53.540 --> 04:00.200]  And how garbage collectors actually does a collection is using mark-and-sweep algorithm.
[04:00.380 --> 04:08.000]  So it will start off from the root nodes which are called GC root. And from the GC root,
[04:08.000 --> 04:13.540]  it traverses the object tree using depth-first search and keeps on marking all the objects
[04:13.540 --> 04:18.760]  that are reachable. In the sweep phase of the direct garbage collector's collection
[04:18.760 --> 04:24.520]  process, it is going to find out all the objects on the heap which are unreachable, and it
[04:24.520 --> 04:31.800]  is going to collect only those objects. So when a developer is, you know, writing code,
[04:31.800 --> 04:41.420]  he may forget to unreference some objects which may be these ones which are in the circle.
[04:42.220 --> 04:48.060]  And garbage collector will not collect these set of objects just because they are reachable
[04:48.060 --> 04:53.500]  from the GC roots. And these objects may contain sensitive information and they're going to
[04:53.500 --> 04:59.260]  lie around in your application throughout the lifetime of your application until the
[04:59.260 --> 05:07.140]  application is terminated. So this is where Androsia kicks in. It will clean off all the
[05:07.140 --> 05:13.660]  unused but reachable objects in your memory. So to summarize what I've been talking about
[05:13.660 --> 05:19.000]  until now, reachable and unused string builder objects may contain sensitive information.
[05:19.000 --> 05:22.880]  So throughout the course of this presentation, I'm going to specifically talk about string
[05:22.880 --> 05:28.040]  builder objects as an example. But this framework is, you know, extendable to any other kind
[05:28.040 --> 05:34.100]  of objects as well. So reachable and unused string builder objects may contain sensitive
[05:34.100 --> 05:39.200]  information and a heap dump or an application compromise will help an attacker to reveal
[05:39.200 --> 05:43.880]  that sensitive information easily. So don't just rely on the garbage collector to collect
[05:43.880 --> 05:48.520]  all that information, but instead destroy by overwriting all the critical data that
[05:48.520 --> 05:57.080]  you have in the application. So with that in mind, we do have some solutions provided
[05:57.080 --> 06:03.480]  by Java libraries which allow you to destroy objects that are created on the heap. One
[06:03.480 --> 06:09.980]  of them is the keystore.password protection class. And this class provides a destroy method
[06:09.980 --> 06:17.420]  which can be used to destroy the object's content. But again, you know, it's the developers,
[06:17.940 --> 06:22.720]  you know, it's upon the developer to use this API. He may use it at a very late stage in
[06:22.720 --> 06:28.220]  the program, or he may not even use it at all. So, ideally, we want to automate this
[06:28.220 --> 06:32.540]  process. We don't want to rely on the developer to destroy our data. Our data should be in
[06:32.540 --> 06:38.140]  our hands, not the developer's hands. So, you know, this is where Androsia will again
[06:38.140 --> 06:44.020]  automate that process and instrument code to remove or destroy any object that has critical
[06:44.020 --> 06:56.770]  information. So how does Androsia help? So Androsia uses static code analysis, and it
[06:56.770 --> 07:03.530]  determines the last use of objects at a whole program level. If I had to be specific, it
[07:03.530 --> 07:08.730]  uses a summary-based interprocedural data flow analysis. So there are three terms here,
[07:08.730 --> 07:13.390]  three key terms. One is summary-based, second is interprocedural, and the third one is data
[07:13.390 --> 07:20.830]  flow analysis. So static code analysis power is just not grabbing data and giving you information
[07:20.830 --> 07:25.410]  about what the Android applications permissions are or whatever stuff. It comes from data
[07:25.410 --> 07:32.410]  flow analysis. So every statement in your program has some data to contribute. And that
[07:32.410 --> 07:37.870]  data needs to be propagated across all the statements in your application, and every
[07:37.870 --> 07:43.190]  statement could probably add or remove some data from the data flow sets that are going
[07:43.190 --> 07:50.630]  on. So that's how powerful data flow analysis is. It will help you perform any sort of data
[07:50.630 --> 08:01.170]  related analysis, which may include even taint analysis. So that's where the power of static
[08:01.170 --> 08:08.710]  code analysis is better used. What I mean by summary-based interprocedural analysis
[08:08.710 --> 08:13.950]  is that this is a whole program analysis. This is not just analyzing a specific method.
[08:13.970 --> 08:19.390]  And every method, when it is analyzed, the analysis results can be maintained or cached
[08:19.390 --> 08:26.110]  in a summary. The reason for that is that a particular method, let's say foo, can be
[08:26.110 --> 08:30.770]  called from multiple locations. And if you have already computed the summary for foo,
[08:30.770 --> 08:36.090]  you may want to reuse that computed summary instead of computing it again and again whenever
[08:36.090 --> 08:42.150]  the method is called. So that's why we use summary as a cache for storing our analysis
[08:42.150 --> 08:47.590]  results for a specific method. And after we have determined the last usage point of a
[08:47.590 --> 08:54.790]  particular object, we are going to instrument byte code to clear its memory content. So
[08:54.790 --> 09:00.510]  to give you a very simple example, I have statements numbered 1 to 10 here, and there
[09:00.510 --> 09:06.230]  is a definition of variable x at statement number 3. And then there is a use of variable
[09:06.230 --> 09:12.950]  x at statement number 5. So between statement number 3 and 5, since the variable is defined
[09:12.950 --> 09:19.930]  and used, I can say that the variable is live. But before statement number 3, the variable
[09:19.930 --> 09:26.430]  x wasn't even defined and neither was it used. So it is said to be dead between 1 and 3.
[09:26.430 --> 09:35.610]  And now at 7, the variable x gets redefined and then reused at 8. So between 7 and 8,
[09:35.610 --> 09:41.270]  the variable is again live. However, between 5 and 7, the variable won't be live. It has
[09:41.270 --> 09:47.510]  been defined, but it is not being used. So this is just to give you an idea about how
[09:47.510 --> 09:52.770]  liveness of a variable looks like. And we are going to use it further ahead when we
[09:52.770 --> 09:58.270]  are seeing how Androsia really works and how it computes the liveness sets and how
[09:58.270 --> 10:08.100]  it then infers the last usage point. So basically, if you have a span of definition and usage,
[10:08.100 --> 10:11.040]  between that span, a variable is going to be live.
[10:13.540 --> 10:18.640]  So this is a snapshot of the heap dump taken using Eclipse Memory Analyzer Toolkit. So
[10:18.640 --> 10:24.740]  I had this sample code which contained a static field which is called static secret
[10:24.740 --> 10:30.380]  and it contained a password. And it shows up in the heap dump. And after the optimization
[10:30.380 --> 10:36.300]  or after the instrumentation, the static secret does not contain any password at all because
[10:36.300 --> 10:40.220]  Androsia has successfully removed the password from the heap memory.
[10:40.220 --> 10:48.420]  So, in order to give you an overview where Androsia would fit in is that a user could
[10:48.420 --> 10:55.620]  provide an application or source code to Androsia server. And Androsia server can then unpack
[10:55.620 --> 11:03.500]  the application to Dalvik byte code. And the Dalvik byte code gets converted into
[11:03.500 --> 11:10.160]  Gimple code. Now, Gimple is an intermediate representation. Just like Smiley, it is an
[11:10.160 --> 11:15.500]  intermediate representation which takes good from both the worlds. It takes good from the
[11:15.500 --> 11:21.400]  Dalvik byte code as well as good from the Java high-level language as well. So Gimple
[11:21.400 --> 11:27.320]  is basically a mixture of Java and it is simple so they call it Gimple. So Gimple has only
[11:27.320 --> 11:32.200]  15 different kind of statements. Unlike Java which has a lot of complicated structures
[11:32.200 --> 11:37.380]  and different kind of statements, even Dalvik byte code has like 200 different opcodes.
[11:37.620 --> 11:44.400]  So Gimple has only a precise number of statements which is 15 and, you know, it is easier to
[11:44.400 --> 11:48.960]  analyze when you have just 15 odd statements to play around with. And this Gimple code
[11:48.960 --> 11:56.380]  is then analyzed by the analysis that we have embedded into Androsia and then the code instrumentation
[11:56.380 --> 12:01.280]  happens and then we convert the Gimple code back to Dalvik byte code and then the Dalvik
[12:01.280 --> 12:07.260]  byte code is packaged again and signed or we could just provide the analysis results
[12:07.260 --> 12:17.960]  to the user. So this is the entire flow of how will you eventually use Androsia. So an
[12:17.960 --> 12:23.080]  important fact that I should also mention here is that the framework which is called
[12:23.080 --> 12:28.840]  suit which is the backbone of Androsia can be used to perform any sort of data flow analysis.
[12:28.840 --> 12:35.120]  It's not just the analysis that I have done. So I'm going to talk about suit next and it
[12:35.120 --> 12:40.880]  can be used, you know, you can build your own data flow analysis over suit. So suit
[12:40.880 --> 12:47.100]  is basically static, so it's a framework for Java byte code analysis and it can be used
[12:47.100 --> 12:52.660]  to implement your own analysis as well. It provides this three address code representation
[12:52.660 --> 12:58.500]  called Gimple and it looks something like this. So for example, I have an if statement
[12:58.500 --> 13:05.520]  here and in any line of Gimple code, you will not see more than three operands in one line.
[13:05.520 --> 13:12.740]  So you have R1, null and label zero here. So it is very simple to view, you know. It's
[13:12.740 --> 13:20.800]  not like Java that you can have a lot of complicated structures in one statement itself. And this
[13:20.800 --> 13:26.800]  diagram shows you that the suit framework itself takes in the Java source code or the
[13:26.800 --> 13:32.340]  class files and converts it into Gimple three address intermediate representation. And
[13:32.340 --> 13:36.960]  then you can perform your analysis optimization and then again convert it back to the class
[13:36.960 --> 13:42.680]  files. So until recently, actually, you know, as I talked about that we are going to do
[13:42.680 --> 13:47.920]  an inter-procedural analysis, so suit also provides you the ability to perform an intra-procedural
[13:47.920 --> 13:56.480]  analysis. So until now, until recently, suit was missing a Dalvik to Gimple transformation
[13:56.480 --> 14:06.020]  module. And now that void has been fulfilled. And that void has been fulfilled by a plugin
[14:06.020 --> 14:13.100]  called Dexplore which allows you to transform Dalvik to Gimple code. So it makes it easier
[14:13.100 --> 14:18.680]  for you to analyse any Android application. Okay. So there's one more tool that I want
[14:18.680 --> 14:23.940]  to, you know, talk about here which is called FlowDroid. Now, FlowDroid allows you to generate
[14:24.020 --> 14:29.900]  a dummy main method because unlike Java, Android does not have, Android applications do not
[14:29.900 --> 14:35.520]  have a main method. And to start your data flow analysis, you need a starting point.
[14:35.600 --> 14:44.110]  And this dummy main method acts like a starting point for your data flow analysis. The dummy
[14:44.110 --> 14:51.250]  main method actually connects all the Android life cycle callbacks. So it gives you, you
[15:04.630 --> 15:16.630]  know, a dummy main method. So that way, you can start your data flow analysis from a single
[15:16.630 --> 15:25.650]  point. Okay. So objects can, you know, exist in various scopes. They can either exist as
[15:25.870 --> 15:33.170]  a local variable inside a method. For example, I have XYZ variables here which contain eventually
[15:33.170 --> 15:38.910]  the string builder objects. And their scope is limited to this method foo only. They're
[15:38.910 --> 15:46.010]  not used outside this method anywhere. So this is one simple example which I'm going
[15:46.010 --> 15:50.710]  to take or, you know, it will be a walk-through example throughout the course of this presentation.
[15:50.830 --> 15:58.390]  So this example will be recurring. And I have a variable X here which I'm instantiating
[15:58.390 --> 16:03.210]  with a string builder object. So basically X will be referring to the string builder
[16:03.210 --> 16:07.490]  which contains the value secret. Y will be referring to the string builder which contains
[16:07.490 --> 16:12.210]  the value password. And there's some logic here which I'm going to, you know, I'll just
[16:12.210 --> 16:18.290]  add it for the sake of explaining things. So the scope could be either as a local variable
[16:18.290 --> 16:25.890]  or your objects could even exist as static fields of a class. And this is completely
[16:25.890 --> 16:30.450]  different because, you know, the scope of a static field is the entire program. It's
[16:30.450 --> 16:39.070]  not just a specific method. So X, the value X, can be invoked from anywhere else using
[16:39.070 --> 16:48.450]  the class name which is my class. So this is how you will invoke variable, sorry, this
[16:48.450 --> 16:57.000]  is how you will invoke variable X. And this invocation can happen in any other method
[16:57.000 --> 17:04.720]  which is bar in my case. So the scope, you know, becomes the entire program. Or you can
[17:04.720 --> 17:10.920]  have objects as instance field. Here you can see that I have a private access modifier
[17:10.920 --> 17:17.640]  specified for instance field X inside the my instance field string builder class. So
[17:18.340 --> 17:25.780]  the scope of this variable X will now be limited to the scope of the object of this class,
[17:25.780 --> 17:36.830]  my instance field SB. Okay. So coming to the demo, I have three classes here. So I have
[17:36.850 --> 17:41.770]  three classes here. And one of them is the main activity where my control will begin.
[17:41.950 --> 17:48.290]  And this is where the secret has been defined. And then I'm calling the useStaticField method.
[17:49.150 --> 17:55.770]  And the useStaticField method uses the static secret and then calls bar method. And inside
[17:55.770 --> 18:02.310]  the bar method, I'm again using the static secret. So if I had to, you know, just ask
[18:02.310 --> 18:07.270]  you guys to point out the last usage of the static secret, it would be like inside the
[18:07.270 --> 18:12.770]  bar method, right? Here. So the instrumentation should happen immediately after the system.out.println
[18:12.770 --> 18:19.710]  statement of static secret. That's correct. But, you know, there can be loops as well.
[18:20.370 --> 18:24.590]  Like this statement could be inside a for loop. And then the instrumentation point will
[18:24.590 --> 18:28.610]  not be immediately after static secret. It will be right before the return statement
[18:28.610 --> 18:34.910]  of bar method. If there's a loop outside the call to the bar method, then your instrumentation
[18:34.910 --> 18:41.190]  will happen here. Why is that? Because if you instrument it within the loop, then your
[18:41.190 --> 18:48.670]  logic of the application will break. And you don't want that to happen. So we'll see an
[18:48.670 --> 18:55.230]  example, actually a demo on it, which is on the next slide. So this is the same code which
[18:55.230 --> 19:01.810]  I showed you on the last slide. I have a password string builder object being created
[19:01.810 --> 19:06.390]  which contains the value password. And then I'm calling the useStaticField method here.
[19:07.270 --> 19:13.250]  Now we'll see the definition of the useStaticField. The useStaticField is using the static secret
[19:13.250 --> 19:19.270]  and calling bar method. And inside the bar method, we have again useStaticSecret. And
[19:19.270 --> 19:26.130]  then we have two print statements which are printing hello and bye. So if you see, the
[19:26.130 --> 19:32.450]  instrumentation point should be right now just after the last use of static secret,
[19:32.450 --> 19:43.690]  which is on line 20. So let's run the tool on this. So this is how the dummy main method
[19:43.690 --> 19:52.570]  looks like. You'll see it in the stack trace that gets printed out. And right now, the
[19:52.570 --> 19:56.750]  output format is set to jimple so that we can see where the instrumentation is happening.
[19:56.750 --> 20:03.510]  You can even set it to Dalvik bytecode so you'll get a dex format as an output. Alright.
[20:03.510 --> 20:07.650]  We're in the onCreate method, but we need to go to the bar method, right? So inside
[20:07.650 --> 20:22.170]  the bar method... so can we pause the video, please? So you'll see that right here, you
[20:22.170 --> 20:28.170]  have the print line statement for static secret. This is the static secret reference variable,
[20:28.170 --> 20:38.670]  which is an R2. And you're printing the value of R2. And from line number 30 to 34 is where
[20:38.670 --> 20:44.090]  the code from Androsia that has been instrumented. So what I'm doing here is that I'm getting
[20:44.290 --> 20:49.550]  a reference to the static secret field. I'm calculating the length of the static secret
[20:49.550 --> 20:55.870]  field. And I'm using the delete API from zero to the length of the static secret to
[20:55.870 --> 21:02.870]  remove or clear its memory content. So the instrumentation point right now is right after
[21:02.870 --> 21:06.770]  the print line statement for static secret, which is the correct instrumentation point.
[21:06.870 --> 21:16.190]  So can we resume the video, please? Alright. Now what we're going to do is that we're going
[21:16.190 --> 21:21.930]  to remove the for loop here. We're going to uncomment the for loop here. And now the
[21:21.930 --> 21:27.910]  instrumentation point will not be between, will not be actually right after the static
[21:27.910 --> 21:34.570]  secret statement, but it will be right before the by statement here. The system.out.print
[21:34.570 --> 21:40.810]  line for the by statement, by keyword. So yeah. That's because you don't want to instrument
[21:40.810 --> 21:45.650]  within the loop, because in the next iteration of the loop, your resetted value will be used,
[21:45.650 --> 21:51.550]  which will break the logic of the code. Yeah. So this time the instrumentation, which is
[21:51.550 --> 22:00.170]  from 31 to 34, 33, is happening right before the print statement for by, which is the right
[22:00.170 --> 22:05.790]  instrumentation point. And Androzia is smart enough to figure out if there is a loop, it
[22:05.790 --> 22:11.430]  can hoist the instrumentation point outside the loop. So that's the main idea that I want
[22:11.430 --> 22:17.310]  to convey here. And that goes on for any level of nested for loops. You know, you can
[22:17.310 --> 22:21.670]  have the bar method being called from a for loop like this, and then the instrumentation
[22:21.670 --> 22:27.350]  point will change to somewhere in between, somewhere just before returning the use static
[22:27.350 --> 22:32.850]  field method. So the return statement would be somewhere here in the Gimple code, and
[22:32.850 --> 22:38.270]  the instrumentation point will be right before the return statement. This is the use static
[22:38.270 --> 22:43.290]  field, and right before the return statement of the use static field, you'll find the instrumented
[22:43.290 --> 22:51.210]  code from line number 73 to 74, 75, actually. Even 77. So this is the part which gets
[22:51.210 --> 22:56.650]  instrumented, just before the return statement. So you see that the instrumentation points
[22:56.650 --> 23:02.590]  are changing according to the loops that you have inside the code, which is, you know,
[23:03.150 --> 23:08.030]  which may not be because of nested loops, it may be even because of recursion of a
[23:08.030 --> 23:13.450]  specific method. So Androsia will be smartly identifying if there is recursion or loops,
[23:13.450 --> 23:20.210]  and changing the instrumentation points accordingly. So we've seen how Androsia is doing or performing
[23:20.210 --> 23:26.550]  the instrumentation, but we really don't know what is the logic behind that, what is the
[23:26.550 --> 23:33.710]  algorithm that is running behind Androsia. So to come to that, first let's identify what
[23:33.710 --> 23:39.130]  information can a single statement provide us. So there is, every line of statement has
[23:39.130 --> 23:44.030]  some or the other data to add, right? And the data that we are concerned about here
[23:44.030 --> 23:55.280]  is called liveness data. So the next few slides are going to talk about the live variable
[23:55.280 --> 24:03.060]  analysis definition, and how we compute the summary for every method, which is going to
[24:03.060 --> 24:08.820]  look something like this. It will be a two-tuple combination. Basically, the first tuple will
[24:08.820 --> 24:16.180]  be the variable. So the summary of foo method will tell you that the variable x was last
[24:16.180 --> 24:21.280]  used in the statement if y.length is less than x.length. It will be something like this.
[24:21.400 --> 24:27.240]  So for every variable, you will have, you know, a two-tuple element, and the summary
[24:27.240 --> 24:32.500]  will tell you how many variables are there, and what is the last usage point of every
[24:32.500 --> 24:39.260]  variable inside it. So to compute the summary, we have a two-step process. One is to use
[24:39.260 --> 24:46.280]  of, one is to compute the definition and use sets for every statement. Then using the definition
[24:46.900 --> 24:52.640]  and use sets, we are going to compute live variable sets for every statement. LVENTRY
[24:52.640 --> 24:59.820]  and LVEXIT. And using the live variable entry and exit sets, we're going to infer the last
[24:59.820 --> 25:06.120]  usage point for a local or static field reference within that method. And we're going to store
[25:06.120 --> 25:13.850]  that as a summary. And now, once we have the summaries for every method that is analyzed
[25:13.850 --> 25:20.390]  by Androsia, we're going to compute the last usage point of a static field reference, considering
[25:20.390 --> 25:25.170]  the whole program, not just within a method. Because we know that the summary will tell
[25:25.170 --> 25:29.590]  us that within a method, this is the last usage point of a particular local variable.
[25:29.590 --> 25:34.490]  But it will not tell us across the entire program where is the last usage actually happening.
[25:34.490 --> 25:41.030]  So the summaries need to be combined and propagated across your program to infer that knowledge.
[25:44.400 --> 25:50.180]  So to start off with the definition of live variable analysis, live variable analysis
[25:50.180 --> 25:55.820]  determines for each statement, which variables must have a subsequent use prior to the next
[25:55.820 --> 26:00.180]  definition, which is what we learned from the diagram we saw earlier, right? We had
[26:00.180 --> 26:05.500]  a definition and use of variable X. And between that span, the variable was live. And after
[26:05.500 --> 26:10.600]  that, it was dead. So this is exactly what the statement is telling us. So in this particular
[26:10.600 --> 26:18.340]  code, the last usage point of variable X is this blue statement, which is number 4. And
[26:18.340 --> 26:28.240]  the last usage point of Y is statement number 5 and 6. Because they are two different statements
[26:28.240 --> 26:32.760]  within the if clause. So either this one could execute or this one could execute based on
[26:32.760 --> 26:39.540]  the result from the if clause. And the last use of Z is statement number 7. So this is
[26:39.540 --> 26:46.020]  very intuitive, right? We can see it. But how do you determine this using an automated
[26:46.020 --> 26:52.500]  way? That's the question. And it's not as easy as it seems like. So we're going to run
[26:52.500 --> 26:57.400]  through this same example in the next few slides and figure out how we determine the
[26:57.400 --> 27:03.540]  last usage points of X, Y, and Z. So the last usage point of a variable is also the
[27:03.540 --> 27:10.300]  last statement where that variable was live. So X is last live here. And it is also the
[27:10.300 --> 27:15.840]  last usage point of that variable X. So we're going to use this fact to determine the last
[27:15.840 --> 27:24.260]  usage point of every variable. So talking about the def use sets, the definition of
[27:24.300 --> 27:30.340]  a definition set is that it will contain all the variables defined in a particular statement.
[27:30.460 --> 27:37.420]  So in statement number 1, X is being defined. So for statement 1, you'll have an entry X.
[27:37.980 --> 27:44.720]  So the use set are the set of variables that are used in a statement. So if we run through
[27:44.720 --> 27:49.900]  this example and compute the def use set, they will look something like this. So we
[27:49.900 --> 27:56.360]  start from the bottom. Z is being used here. So it will go in the use column for statement
[27:56.360 --> 28:02.320]  number 7. And X is being defined here. So it will go in the definition column for statement
[28:02.320 --> 28:09.400]  number 7. And similarly, if we keep going, it will eventually populate the entire table.
[28:10.420 --> 28:18.520]  So the direction of propagating that data or propagating the data flow facts is from
[28:18.520 --> 28:24.900]  bottom to top. It is the reverse order of execution of the program. So if I compute
[28:24.900 --> 28:32.620]  the live variable of exit of 6th statement, which is the last statement in my code, then
[28:32.620 --> 28:38.920]  it is going to be 5. Because there is no other variable which is being used after statement
[28:38.920 --> 28:44.920]  number 6. So this is a must. Like this has to be 5 because there cannot be any other
[28:44.920 --> 28:50.440]  variable which is used after the last statement in the code. So this will be an initial point
[28:50.440 --> 28:56.620]  for the algorithm to run. And once you compute LVExit of 6, you are going to use the def
[28:56.620 --> 29:04.560]  use set of 6 to compute LVEntry of 6. And then you will compute LVExit of 5, then you
[29:04.560 --> 29:11.400]  will compute LVEntry of 5, then LVExit of 4, LVEntry of 4. Similarly, once you reach
[29:11.400 --> 29:18.660]  at a point where there is two branches, you are going to merge the results from LVEntry
[29:18.660 --> 29:29.640]  3 and LVEntry 4 using a union operator. And that result will go into LVExit of 2. And
[29:29.640 --> 29:36.500]  using LVExit of 2, we are going to compute LVEntry of 2 and so on. So that's up next.
[29:36.500 --> 29:41.320]  We are going to use the same example, the one we discussed for the def use set and we
[29:41.320 --> 29:45.600]  are going to compute these things. So if I had to give you a mathematical representation,
[29:45.600 --> 29:50.980]  I mean, there's nothing possible without math here because data flow analysis needs to have
[29:50.980 --> 29:56.560]  math backing it up for proof of correctness. And without any algorithm or without some
[29:56.560 --> 30:03.280]  math, there's no way you can perform a data flow analysis. So the LVExit of a statement
[30:03.280 --> 30:09.600]  L will be 5 if L is the last statement of the body, which is what we discussed on the
[30:09.600 --> 30:20.460]  last slide. And it will be the union over the LVEntry set of the successors of L. Otherwise,
[30:20.460 --> 30:26.720]  so if you have a merge point, we'll have to take this LVEntry sets and do a union over
[30:26.720 --> 30:33.840]  all the entry sets to compute LVExit of L. Now once we have the LVExit of L, we are going
[30:33.840 --> 30:40.920]  to plug that value here and we already computed LVEfinition and use it for a statement L.
[30:41.660 --> 30:49.320]  And that's going to give us LVEntry of L. Let's see this in action. So we have the def
[30:49.320 --> 30:57.100]  use columns populated. We know that the LVExit of 7 statement will be 5 because this is the
[30:57.100 --> 31:03.820]  last statement. And now we are going to use this formula to compute LVEntry of 7. So LVExit
[31:03.820 --> 31:17.720]  of 7 will be 5 minus x union z, which will give you z. And then LVEntry of 7 will form
[31:17.720 --> 31:27.620]  the LVExit of 6. So LVExit of 6 will be z. And now z minus z union y will give you y.
[31:27.620 --> 31:33.800]  So you get the LVEntry of statement number 6 as well. And so on, you can populate this
[31:33.800 --> 31:42.700]  table iteratively and you get the entire table. Now, the interesting part to note
[31:42.700 --> 31:50.440]  here is that if a variable disappears from the entry set to the exit set in the live
[31:50.440 --> 31:56.920]  variable table, like in the fourth statement, we have x disappearing from the entry set
[31:56.920 --> 32:03.560]  to the exit set. In the fifth and sixth statement, y disappears from entry set to the exit set.
[32:03.560 --> 32:09.680]  And in the seventh statement, we see that z disappears from the entry set to the exit
[32:09.680 --> 32:15.540]  set. So those will be, the corresponding numbers will be the last usage point of those variables.
[32:16.060 --> 32:23.380]  And that's what we were after. We've got that now. So now we're going to store all these
[32:23.380 --> 32:30.240]  results, the last usage point results, as the summary of foo method. And we're going
[32:30.240 --> 32:35.280]  to use this summary in other parts of the program when the foo method gets invoked from
[32:35.280 --> 32:41.520]  there. We don't want to compute this again and again and again. So let's just say that
[32:41.520 --> 32:47.740]  foo was, there was a method foo which calls bar and bar called baz. Okay? So, and just,
[32:48.440 --> 32:52.920]  we've already computed, let's just assume that we've already computed summaries for each
[32:52.920 --> 32:58.640]  of these individual methods. Foo, bar, as well as baz. And baz will tell me that, the
[32:58.640 --> 33:04.700]  summary of baz is going to tell me that static field reference was last used at C4. The
[33:04.700 --> 33:10.320]  summary for bar is telling me that the static field reference was used at B5. And there
[33:10.320 --> 33:17.460]  was no use of SFR in foo. Now what I'm going to do is, I'm going to propagate the summaries
[33:17.460 --> 33:25.600]  across these methods in a reverse topological order. So, this is how the analysis will run.
[33:25.600 --> 33:31.480]  And when you read statement C4, you see that SFR is being used. What I'm going to do is
[33:31.480 --> 33:39.500]  create a data structure with the value baz, SFR, C4, which is going to tell me that the
[33:39.500 --> 33:48.680]  last use of SFR is happening at C4 statement in baz method. Okay? And once I reach, I've
[33:48.680 --> 33:54.220]  already analyzed, I've completed the analysis for baz method, I'm going to analyze the bar
[33:54.220 --> 34:00.660]  method. So, the analysis happens in a reverse topological order. So, if baz is the last
[34:00.660 --> 34:05.080]  method that is called, it will be analyzed first. If bar is the second last method that
[34:05.080 --> 34:09.640]  is analyzed, it will be, you know, if bar is the second last method that is called,
[34:09.640 --> 34:15.360]  it will be analyzed second last. And if the first, sorry, so basically foo is calling
[34:15.360 --> 34:19.480]  bar, bar is calling baz, baz will be analyzed first, bar will be analyzed second, and then
[34:19.480 --> 34:27.100]  you'll analyze foo. So, here, we see that SFR is being used at B5. So, the LV entry
[34:27.100 --> 34:34.120]  will contain B5 SFR. And once we reach the call statement for baz, we're going to pull
[34:34.120 --> 34:40.760]  in the summary of baz method from here. And now we're going to see whether the LV entries,
[34:40.760 --> 34:46.740]  LV exit set of B3 already contains an SFR's use, and it does in our case. So, we're going
[34:46.740 --> 34:53.280]  to replace the value of this little red data structure with the new entry, which will
[34:53.280 --> 34:59.760]  be bar SFR B5, which would mean that SFR is last used at B5 statement inside bar method.
[35:00.060 --> 35:06.460]  So, this is how you'll keep on going and updating the red data structure. And once the analysis
[35:06.460 --> 35:12.220]  is complete over all these methods, you'll realize that the red data structure will complete
[35:12.220 --> 35:19.440]  the last usage point of the SFR variable across the entire program. So, similarly,
[35:19.440 --> 35:24.000]  if you go ahead and see that there's a bar method invocation here, you're going to pull
[35:24.000 --> 35:31.360]  in the summary for bar method, but the exit set of A3 contains 5, because SFR is not being
[35:31.360 --> 35:37.900]  used ahead in the program inside the foo method. So, this summary will not be over, this particular
[35:37.900 --> 35:43.040]  value will not be overwritten by anything, because it's just 5 that is being coming from
[35:43.040 --> 35:48.100]  the exit set of A3. We just completed our analysis, and the results tell us that SFR
[35:48.100 --> 35:55.100]  was last being used at B5 statement inside the bar method. So, this is how, you know,
[35:55.100 --> 36:00.260]  algorithmically, Androsia will run on your code and figure out what is the last usage
[36:00.260 --> 36:08.500]  point of any object in that application. Just to summarize, we went through this definition
[36:08.500 --> 36:15.700]  of live variable analysis. We computed summaries for individual methods using a two-step process.
[36:15.700 --> 36:20.580]  One of the steps is to compute the defuse set for every statement, and using the defuse
[36:21.840 --> 36:27.980]  sets, we computed the LV entry and exit sets, and using the LV entry and exit sets, we figured
[36:27.980 --> 36:33.380]  out the last usage point of every static field reference or local method, and put it
[36:33.380 --> 36:37.660]  in a summary, and eventually, we used the summary to compute the last usage point for
[36:37.800 --> 36:44.800]  a static field reference at a whole program level. Phew! So, that was too much. I mean,
[36:44.800 --> 36:48.260]  just intuitively, you can see what is the last use point, but to automate that, you
[36:48.260 --> 36:57.900]  require a lot of algorithm and math. So, yeah, I mean, things are not easy always. So, for
[36:57.900 --> 37:03.020]  instance field approach, the approach, you know, for instance fields is a little different.
[37:03.020 --> 37:07.900]  What we do is we mark all the classes which are string builder instance fields. We find
[37:07.900 --> 37:13.420]  their object instances. We track the last usage of object instances and their aliases
[37:13.420 --> 37:19.400]  instead of the string builder fields themselves. We will actually see a demo of this again,
[37:19.400 --> 37:25.380]  and I think that would explain this even better. So, just let's see, we have three classes
[37:25.380 --> 37:29.440]  here, which one of them is main activity. This is where my control is going to start.
[37:29.480 --> 37:35.580]  I have a secret defined here, which is a variable, which is going into a variable my secret,
[37:36.320 --> 37:43.320]  and then I'm instantiating the my class, which is here, and the variable that is referring
[37:43.320 --> 37:49.300]  to the object of my class is mc. Then we have a wrapper class whom I'm instantiating, and
[37:49.300 --> 37:58.740]  the reference variable for that is w. And then I have a call to call w method, which
[37:58.740 --> 38:05.080]  is defined inside the wrapper method. Wrapper class, sorry. And I'm passing the mc object
[38:05.080 --> 38:11.620]  instance to this call w method, which is invoking set a to pass on the secret value here, and
[38:11.620 --> 38:15.920]  this is just a setter method, which is setting the value of a. So, basically, the secret
[38:15.920 --> 38:21.900]  is here, then it goes here, and then it gets defined in this, and then it gets set in the
[38:21.900 --> 38:28.940]  setter method to the instance field a, which is a private instance field. So, in order
[38:28.940 --> 38:34.340]  to reset, if we had to reset this private instance field, we cannot just do it outside
[38:35.080 --> 38:40.580]  from within the on create method. We need to invoke a method within the class to reset
[38:40.580 --> 38:47.520]  the value of a, because it has been privately, the access modifier is private, so we can't,
[38:47.520 --> 38:59.600]  you know, access it outside its class. So, let's just quickly jump on to the demo. So,
[38:59.600 --> 39:04.180]  the steps that I had explained on the previous slides were we mark all classes which are
[39:04.180 --> 39:11.060]  string builder instance fields, which will be my class. We find the object instances,
[39:11.060 --> 39:18.100]  which is MC. Then we track the last uses of object instances, which is here in the call
[39:18.100 --> 39:25.620]  W statement. And then we instrument the code right after the call W statement. And here,
[39:25.620 --> 39:32.280]  you can see that I'm invoking a method, which is reset SBA. And this reset SBA is being
[39:32.280 --> 39:38.040]  defined in the class right here. That's because I cannot access the private instance field
[39:38.040 --> 39:46.620]  small a within the on create method. So, I have to invoke this reset method. All right.
[39:46.620 --> 39:53.180]  So, let's have a look at the demo. Pretty same code. Pretty much the same code. I have
[39:53.180 --> 39:59.760]  the call W statement here where I'm passing MC and my secret. And I also have the wrapper
[39:59.760 --> 40:07.280]  class here instantiated. And the reference variable is W. And I'm printing the W.SB field.
[40:07.280 --> 40:14.280]  So, SBField is another string builder instance field of wrapper class. And A and B, we can
[40:14.280 --> 40:22.020]  see that. So, A and B will be instance fields of my class. And SB will be an instance field
[40:22.020 --> 40:33.260]  in the wrapper class. We'll see that just in a while. So, you can see that we have an
[40:33.260 --> 40:40.140]  instance field SB here in the wrapper class. And I'm calling set A and passing on the secrets
[40:40.140 --> 40:46.220]  value to the set A method. And my class has A and B as its instance fields. And inside
[40:46.220 --> 40:53.380]  the set A method, I'm setting the value of the secret to A. B is nowhere used right now.
[41:00.820 --> 41:05.640]  So, inside the wrapper class, you see that the method call W also has a secret defined
[41:05.640 --> 41:19.650]  called password. And its last use is inside the call W method. So, B is used here. MC
[41:19.650 --> 41:26.250]  is last used here. And W is last used here. So, the instrumentation point for A and B
[41:26.250 --> 41:31.730]  should be right after the call W statement here. And the instrumentation point for the
[41:31.730 --> 41:40.950]  string build SB should be right after the WSB statement here. So, you see here that
[41:40.950 --> 41:47.010]  there's a call W method being invoked. And I have the reset methods being injected or
[41:47.010 --> 41:52.830]  instrumented right after the call W method. Because we were tracking MC. And MC's last
[41:52.830 --> 41:59.030]  use was in the call W method call. And after that, we have reset both the instance fields.
[41:59.030 --> 42:06.570]  And similarly, for the wrapper class, we had an instance field SB which is being reset right
[42:06.570 --> 42:22.630]  here in statement number 58. So, these reset methods are defined in the respective classes
[42:22.630 --> 42:31.430]  which have those instance fields. So, if we go to the respective classes, this is the wrapper
[42:31.430 --> 42:36.310]  class and the reset method has been defined here. What it's doing is the same thing. It's
[42:36.310 --> 42:40.690]  calling the length function on the string builder. And then it's calling the delete
[42:40.690 --> 42:51.380]  API to remove the content of the string builder. And similarly, we have reset methods for variable
[42:51.380 --> 42:59.800]  A and B, which were the instance field in my class. Here we have reset SBB, which is
[42:59.800 --> 43:08.360]  the reset method for variable B. So, you know, we've tackled all the three scopes now. We
[43:08.360 --> 43:13.820]  know how to deal with instance fields. We know how to deal with static fields. We also
[43:13.820 --> 43:19.120]  know how to deal with local variables. So, I mean, we did not need to see all of this
[43:19.120 --> 43:22.940]  because this is too much in depth. But if you want to go ahead and implement your own
[43:22.940 --> 43:28.000]  static code analysis, this is important stuff for you. So, I mean, I thought it appropriate
[43:28.000 --> 43:35.900]  to be, you know, displayed here. So, the work in progress is basically we're working on
[43:35.900 --> 43:41.080]  a test development so that we can, you know, get rid of the remaining bugs that we have.
[43:41.080 --> 43:47.020]  And we're also planning to include this into the CICD pipeline, which should be straightforward.
[43:47.280 --> 43:52.680]  And that way, a lot of big companies could, you know, just plug in their APKs and just
[43:52.680 --> 43:57.480]  push it across to us. And then we can just analyze those APKs and instrument the code
[43:57.480 --> 44:06.320]  that Androsia, you know, instruments to clear the memory contents of the APK. So, that way,
[44:06.320 --> 44:11.920]  it will be helpful for big corporates and especially the enterprise mobility management
[44:11.920 --> 44:18.360]  companies. So, if you want to use Androsia or you want to contribute or get in touch,
[44:18.360 --> 44:22.980]  this is the URL you should take a note of. And the tool and documentation will be available
[44:22.980 --> 44:29.100]  here. And if you want to contribute to the tool, here's my Twitter account or email.
[44:29.100 --> 44:42.120]  You can just shoot out an email to me. All right. So, these are some good references.
[44:42.160 --> 44:47.200]  The first three ones will get you started when it comes to creating your own data flow
[44:47.200 --> 44:54.300]  analysis. And the last two ones are a little bit more in-depth. If you want to, you know,
[44:54.300 --> 44:59.320]  see how things are actually working in the background. I'll anyways post these slides
[44:59.320 --> 45:09.660]  on the website of DEF CON. So, you can have a look at these references. Cool. So, that
[45:09.660 --> 45:15.500]  brings me to the end of my talk. Thank you for your time. I'm really happy to speak over
[45:15.500 --> 45:19.160]  here on this podium. I hope you enjoyed it. Thank you very much.
